Dynamic Task Parallelism with a GPU Work-Stealing Runtime System

نویسندگان

  • Sanjay Chatterjee
  • Max Grossman
  • Alina Simion Sbîrlea
  • Vivek Sarkar
چکیده

NVIDIA’s Compute Unified Device Architecture (CUDA) and its attached C/C++ based API went a long way towards making GPUs more accessible to mainstream programming. So far, the use of GPUs for high performance computing has been primarily restricted to data parallel applications, and with good reason. The high number of computational cores and high memory bandwidth supported by the device makes it an ideal candidate for such applications. However, its potential for executing applications that may combine dynamic task parallelism with data parallelism has not yet been explored in detail, in part because CUDA does not provide a viable interface for creating dynamic tasks and handling load balancing issues. Any support for dynamic task parallelism has to be orchestrated entirely by the CUDA programmer today. In this work we extend CUDA by implementing a work stealing runtime on the GPU. We introduce a finish-async style API to GPU device programming with the aim of executing irregular applications efficiently across multiple shared multiprocessors (SM) in a GPU device without sacrificing the performance of regular data-parallel applications within an SM. We present the design of our new intra-device inter-SM workstealing runtime system and compare it to past work on GPU runtimes, as well as performance evaluations comparing execution using our runtime to direct execution on the device. Finally, we show how this runtime can be targeted by extensions to the high-level CnC-CUDA programming model introduced in past work.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Compiler Support for Work-Stealing Parallel Runtime Systems

Multiple programming models are emerging to address an increased need for dynamic task parallelism in multicore shared-memory multiprocessors. Examples include OpenMP 3.0, Java Concurrency Utilities, Microsoft Task Parallel Library, Intel Threading Building Blocks, Cilk, X10, Chapel, and Fortress. Scheduling algorithms based on work-stealing, as embodied in Cilk’s implementation of dynamic spaw...

متن کامل

Work stealing for GPU-accelerated parallel programs in a global address space framework

Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parall...

متن کامل

Hierarchical Work Stealing on Manycore Clusters

Partitioned Global Address Space languages like UPC offer a convenient way of expressing large shared data structures, especially for irregular structures that require asynchronous random access. But the static SPMD parallelism model of UPC does not support divide and conquer parallelism or other forms of dynamic parallelism. We introduce a dynamic tasking library for UPC that provides a simple...

متن کامل

Elastic Tasks: Unifying Task Parallelism and SPMD Parallelism with an Adaptive Runtime

In this paper, we introduce elastic tasks, a new high-level parallel programming primitive that can be used to unify task parallelism and SPMD parallelism in a common adaptive scheduling framework. Elastic tasks are internally parallel tasks and can run on a single worker or expand to take over multiple workers. An elastic task can be an ordinary task or an SPMD region that must be executed by ...

متن کامل

Work-First and Help-First Scheduling Policies for Terminally Strict Parallel Programs

Multiple programming models are emerging to address an increased need for dynamic task parallelism in applications for multicore processors and shared-addressspace parallel computing. Examples include OpenMP 3.0, Java Concurrency Utilities, Microsoft Task Parallel Library, Intel Thread Building Blocks, Cilk, X10, Chapel, and Fortress. Scheduling algorithms based on work stealing, as embodied in...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011